Probabilistic Data Generation for Deduplication and Data Linkage
نویسنده
چکیده
In many data mining projects the data to be analysed contains personal information, like names and addresses. Cleaning and preprocessing of such data likely involves deduplication or linkage with other data, which is often challenged by a lack of unique entity identifiers. In recent years there has been an increased research effort in data linkage and deduplication, mainly in the machine learning and database communities. Publicly available test data with known deduplication or linkage status is needed so that new linkage algorithms and techniques can be tested, evaluated and compared. However, publication of data containing personal information is normally impossible due to privacy and confidentiality issues. An alternative is to use artificially created data, which has the advantages that content and error rates can be controlled, and the deduplication or linkage status is known. Controlled experiments can be performed and replicated easily. In this paper we present a freely available data set generator capable of creating data sets containing names, addresses and other personal information.
منابع مشابه
Probabilistic Deduplication, Record Linkage and Geocoding
Outline Background and illustrative example Record linkage Applications, privacy and ethics Our project and our tools Data cleaning and standardisation Probabilistic data standardisation and HMMs Blocking / indexing Record pair classification Geocoding Outlook Peter Christen, May 2005 – p.2/28
متن کاملProbabilistic Linkage of Persian Record with Missing Data
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...
متن کاملProbabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering
Probabilistic record linkage, the task of merging two or more databases in the absence of a unique identifier, is a perennial and challenging problem. It is closely related to the problem of deduplicating a single database, which can be cast as linking a single database against itself. In both cases the number of possible links grows rapidly in the size of the databases under consideration, and...
متن کاملA Probabilistic Deduplication, Record Linkage and Geocoding System
In many data mining projects in the health sector information from multiple data sources needs to be cleaned, deduplicated and linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient. Most of the time the linkage process is challenged by the lack of a common unique entity identifier. Additionally, personal ...
متن کامل